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Abstract 
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i We propose an £i-penalized estimation procedure for high-dimensional lin- 

ear mixed-effects models. The models are useful whenever there is a grouping 
structure among high-dimensional observations, i.e. for clustered data. We 
prove a consistency and an oracle optimality result and we develop an al- 
gorithm with provable numerical convergence. Furthermore, we demonstrate 
the performance of the method on simulated and a real high-dimensional data 
set. 
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1 Introduction 

(N 

1.1 High-dimensional statistical inference: some known re- 
i ' suits for convex loss functions 
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Substantial progress has been achieved over the last decade in high-dimensional 
statistical inference where the number of parameters p is allowed to be of much 
larger order than sample size n. To fix ideas, suppose we focus on estimation 
of a p-dimensional parameter (3q based on n noisy observations where p S> n. 
Although such a problem is ill-posed in general, it can be accurately solved if the 
underlying true structure of (3q is sparse. Here, sparsity may be measured in terms 
of the ^ r -norm ||/3|| r = Ej=i IA?'| r ) (0 < r < oo). Very roughly speaking, high- 
dimensional statistical inference is possible, in the sense of leading to reasonable 
accuracy or asymptotic consistency, if 

log(p) • sparsity(/3 ) a < n, 

where typically a = 2 (cf. formula (TT])) or a — 1 (cf. formula ([2])). and assuming 
that the underlying (e.g. regression) design behaves reasonably. 
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ulating discussions. 
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A lot of attention has been devoted to high-dimensional linear models 



y = X(3 + e, 



with n x p design matrix X and p 3> n. A very popula r and powerful estima- 
tion method is the Lasso, proposed by iTibshirani (1996). It is an acronym for 
Least Absolute Shrinkage and Selection Operator and the name is indicating that 
the method is doing some variable selection in the sense that some of the regres- 
sion coefficient estimates are exactly zero. Among the main reasons why it has 
become very popular for high-dimensional estimation problems are its statistical 
accuracy for prediction and variable selection coupled with its computational fea- 
sibility which involves convex optimization only. The latter is in sharp contrast to 
exhaustive variable selection based on least squares estimation whose computational 
complexity is in general exponential in p. The statistical properties of the Lasso 
in high-dimensional settings have been worked out in numerous articles. Without 
(essentially) a condition on the design X, the Lasso satisfies: 



\X({3 - (3 )\\ 2 2 /n = Op(||/3 o ||i0og(p)/n) 



(1) 



where Op(-) is with respect to p > n — > oo (jBuhhnann and van de Geerl . l201lh . 
That is, if the model is sparse with \\(3q\\i <C ^/log(p)/re, we obt a in con sistency. 



Such kind of a result has been proved bv iGreenshtein and Ritov ( 2004) . Later, 
optimality has been established where (J]} is improved to 



11X09 - MWl/n = O P (s r 2 log(p)/n), 



and furthermore 



||/3 - 0o\\r = P ( S l /r r 2 V^g(p)/n), r e {1, 2}, 



(2) 



where sq equals the number of no n-zero coefficients and £ 2 denotes a restricted 
eigenvalue of the design matrix X (jBuhlmann and van de Geerl . 120111 ). The rate 
in ([2|) is optimal up to the log(p) factor and the restricted eigenvalue £ 2 : oracle 
least squares estimation where the relevant variables would be known would have 
rate Op(so/n). We emphasize that for obtaining optimal convergence rates as 
in ©, we need to make some assumptions on the design that £ 2 is not getting 
too small as p > n — > oo, something we do not require i n Q. Works dealin g 
with various aspects aro u nd (|H include iBunea et all (120071) . Ivan de Geer (2008), 
Zhang and rluand (|2008l ). IMeinshausen and Yul (|2009h and lBickel et al.l (|2009l ) 
A quite different problem is variable selection for inferring the true underlying active 
set Sq — { 1 < k < p : (3 Q ,k ^ 0}. A simple e stimator is5 , = |l<fc<jp : $k ^ 
0} where no significance testing is involved. IMeinshausen and Buhlmannl (|2006( ) 
show for the Lasso that under the so-called neighborhood stability condition for the 
design, the Lasso does consistent variable selection in the sense that 



V[S = S ] -> 1 (p > n 



(3) 



assuming that the non-zero coefficients in Sq are sufficiently large in absolute value, 
e.g. minfegSo \0o,k\ ^ s o^ 2 \/^ o s(p)/ n which is the rate in (0) for r = 1. The 
ne ighborhood st a bility condition is equivalent to the irrepresentable condition used 



Zhao and Yul (|2006l ). and they are both sufficient and (essentially) necessary 



for consistent model selection as in (J3|). Unfortunately, the neighborhood stability 
and the irrepresentable condition are rather restrictive and many designs X would 
violate them. In case of (weaker) restrictive eigenvalue conditions, one still has the 
variable screening property for the Lasso 



¥[S D So] -> 1 (p >n^oo), 



(4) 
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again assuming that the non-zero coefficients in So are sufficiently large in ab- 
solute value. Formula (|4]) says that the Lasso does not miss a relevant variable 
from So; in addition, for the Lasso, the cardinality \S\ < min(n,p) and hence, for 
p n, we achieve a hu ge dimensionality reduction in (jj). The adaptive Lasso, 
proposed by IZou ( 2006 ) is a two-stage method which achieves © under weaker 



restrictive eigenvalue assumpt ion than the irrepresentable condition (jHuang et al 



2008 : van de Geer et al. . 2010h . We summarize the basic facts in Table [3 Moreover, 



Table 1: Properties of the Lasso and required conditions to achieve them 



property 


design condition 


size of non-zero coeff. 


consistency as in (CD 


no requirement 


no requirement 


fast convergence rate as in J21 


restricted eigenvalue 


no requirement 


variable selection as in (J3j 


neighborhood stability 
<^ irrepresentable cond. 


sufficiently large 


variable screening as in @ 


restricted eigenvalue 


sufficiently large 



Restricted eigenvalue assumption is weaker than the neighborhood stability or irrepresentable 
condition (van dc Geer and Biihlmann, 2009). For the adaptive Lasso: variable selection as in {3j 
can be achieved under restricted eigenvalue conditions. 



everything essentially holds in an analogous way when using the Lasso in generalized 
linea r models, i.e. £i-norm pe nalization of t he neg ative log-likelihood ([van de Geer . 
2008). Finally, we note that iBickel et al.l (I2009 | ) prove equivalent th eoretical be- 
haviour of the Lasso and the Dantzig selector (jCandes and Tad . 120071 ) in terms of 
([2|), exemplifying that properties like ([2]) hold for other estimators than the Lasso 
as well. 



Having some variable screening property as in Q, we can reduce the false pos- 
itive selections by various meth ods, besides the adaptive Lasso me ntioned above, 
including also stability selection (Meinshauscn and Biihlmann, 2010 ) based on sub- 



sampling or via assigning p- values ([Wasserman and Roedeii 2009; Me inshausen et al 
2009) based on sample splitting. 

Regarding computation, the Lasso involv es convex optim i zation . Popular algo 



rithms are based o n the homotopy method (jOsborne et all l2000h such as LARS 
(lEfron et all 2004). More recently, it has been argued that the coordinate gradi- 



ent descent approach is typ ically more efficient ( Meier et al. . 20081 Wu and Lange , 



20081 iFriedman et all [2010 ) . 



1.2 High-dimensional linear mixed-effects models with non- 
convex loss function 

The underlying assumption that all observations are inde pendent is not always 
appropriate. We consider here linear mixed-effects models (Laird and Ward. Il982t 



Pinheiro and Bates . 2000; IVerbeke and Molenbergh£ [ |2000tfcemidenkol . l2004li where 



high-dimensional data incorporates a grouping structure with independent obser- 
vations between and dependence within groups. Mixed-effects models, including 
random besides fixed effects, are a popular extension of linear models in that di- 
rection. For example, many applications concern longitudinal data where the ran- 
dom effects vary between groups and thereby induce a dependence structure within 
groups. It is a crucial and important question how to cope with high-dimensional 
linear mixed-effects models. Surprisingly, for this problem, there is no established 
procedure which is well understood in terms of statistical properties. 

The main difficulty arises from non-convexity of the negative log-likelihood func- 
tion which makes computation and theory very challenging. We are presenting some 
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methodology, computation and theory for £i-norm penalized maximum likelihood 
estimation in linear mixed-effects models where the number of fixed effects may be 
much larger than the overall sample size but the number of covariance parameters 
of the random effects part being small. Based on a framew ork for li-penalizati on of 
smooth but non-convex negative log- likelihood functions ( Stadler et al. . 201dl ). we 



develop in Section [3] analogues of ((J), © and (gj), see also Table [JJ and some prop- 
erties of an adaptively ^i-penalized estimator. In our view, these are the key prop- 
erties in high-dimensional statistical inference in any kind of model. For example, 
with ((3]) at hand , p- values for single fixed - effects coefficients could be constructed 
along the lines of lMeinshausen et al. ( 2009f ) , controlling the familywise error or false 



discovery rate (but we do not apply such a method in this paper). Furthermore, 
we design in Section 0] an efficient coordinate gradient descent algorithm for linear 
mixed-effects models which is proved to converge numerically to a stationary point 
of the corresponding non-convex optimization problem. 

We remark that we focus here on the case where it is pre-specified which covari- 
ates are modelled with a random effect and which are not. In some situations, this 
is fairly realistic: e.g., a random intercept model is quite popular and often leads to 
a reasonable model fit. Without pre-specification of the covariates having a random 
effect, one could do variable selection based on penalized likelihood approaches on 
the level of random effects: t his has been develop ed fr om a methodo l ogical and 
computational perspective by iBondell et al.l (l2010h and llbrahim et aD (l2010h for 



low-dimensional settings. Addressing such problems in the truly high-dimensional 
scenario is beyond the scope of this paper. However, we present in Section [6] a 
real high-dimensional data problem where some exploratory analysis is used for 
deciding which covariates are to be modelled with a random effect. This example 
also illustrates empirically that there is a striking improvement if we incorporate 
random effects into the model, in comparison to a high-dimensional linear model fit. 

The rest of this paper is organised as follows. In Section [2] we define the l\- 
penalized linear mixed-effects estimator. In Section [3J we present the theoretical 
results for this estimator before describing the details of a computational algorithm 
in Section 01 After some simulations in Section [5] we apply the procedure to a 
real data set. The technical proofs are deferred to an Appendix in the Supporting 
Information. 



2 Linear mixed-effects models and ^-penalized es- 
timation 

2.1 High-dimensional model set-up 

We assume that the observations are inhomogeneous in the sense that they are not 
independent, but grouped. Let i = 1, . . . , N be the grouping index and j = 1, . . . , 
the observation index within a group. Denote by Nt = n i the total number 

of observations. For each group, we observe a m x 1 vector of responses yi, and 
let Xi be a x p fixed-effects design matrix, (3 a p x 1 vector of fixed regression 
coefficients, Zi a x q random-effects design matrix and bi a group-specific vector 
of random regression co efficients. 

Using the notation from iPinheiro and Batesl fl2000h . the model can be written as 



y i = X i p + Z i bi+e i i = l,...,N, (5) 

assuming that 

i) Si ~ Af ni (0, <J 2 I ni ) and uncorrelated for i = 1, . . . , N, 
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ii) bi ~ Af q (0, 9) and uncorrelated for i = 1, . . . , AT, 

m) £i, . . . , £jv, bx, . . . , few are independent. 

Here, 'J' = 'J'e is a general covariance matrix where 9 is an unconstrained set of 
parameters (with dimension q*) such that Vl/e is positive definite (i.e. by using 
the Cholesky decomposition). Possible structures for \& may be a multiple of the 
identity, a diagonal or a general positive definite matrix. We would like to remark 
that assumption i) can be generalized to i') £j ~ N ni (0, a 2 Ai) with Aj = Aj(A) for 
a parameter vector A. This generalization still fits into the theoretical framework 
presented in Section [3] Nonetheless, for the sake of notational simplicity, we restrict 
ourselves to assumption i). 

As indicated by the index i, the bi are different among the groups. All observations 
have the coefficient (3 in common whereas the value of bi depends on the group that 
the observation belongs to. In other words, for each group there are group-specific 
deviations bi from the overall effects (3. We assume throughout the paper that the 
design matrices Xi and Zi are deterministic, i.e. fixed design. 

We allow that the number p of fixed-effects regression coefficients may be much 
larger than the total number of observations, i.e. Nt <C p. Furthermore, the number 
q of random-effects variables might be as large as q < p, but the dimension q* of 
the variance-covariance parameters is assumed to be small (q* <C Nt). We aim at 
estimating the fixed regression parameter vector (3, the random effects bi and the 
variance-covariance parameters 6 and a 2 . Therefore, <j> := (/3 T ,6 T ,a 2 ) T defines 
the complete parameter vector with at most length p + + 1 . From model 

([5]) we deduce that y±, . . . , un are independent and yi ~ N Ui (Xif3, Vi(0, a 2 )) with 
Vi(6, a 2 ) — Z^gZf + a 2 I ni . Denote the stacked vectors y = {yj , . . . , y^) T , b — 
(bf , . . . , bJ f ) T , e = {ej , . . . , eJf) T and the stacked matrices X = (Xf , . . . , Xj 1 ) T , 
Z = diag(Zi, . . . , Zn) and V — diag(Vi, . . . , Vjy). Then model §5§ can be written 
as 

y = X/3 + Zb + e (6) 
and the negative log-likelihood is given by 

- £(4>) = -l{f3, 9, a 2 ) = i{iV T log(27r) + log \V\ + (y- X(5) T V-\y -X0)}, (7) 

where \V\ = dct(V). 

2.2 ^-penalized maximum likelihood estimator 

Due to the possibly large number of covariates (Nt <C p setting) , we cannot use the 
classical maximum likelihood or restricted maximum likelihood approach. Assume 
that the fixed regression coefficients are sparse in the sense that many parameters 
are zero. We then attenuate these difficulties by adding an £i-penalty on the fixed 
regression coefficients. By doing so, we achieve a sparse solution with respect to the 
fixed effects. This leads us to consider the following objective function: 

1 1 p 

Q A (/3,0,a 2 ) := - log | V| + -(y — X(3) T V^ 1 (y — Xf3) + A ^ \/3 k \, (8) 

fc=i 

where A is a nonnegative regularization parameter. Consequently, we estimate the 
fixed regression coefficient vector (3 and the variance components and a 2 by 

% = 09, 8, a 2 ) = argmin Q x (f3, 6, a 2 ). (9) 

/3,0,o- 2 >O,*>O 
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For fixed variance parameters and a 2 , the minimization with respect to (3 is a 
convex optimization problem. Since we want to make use of this convexity (see 
Section 4), we do not pro file the likelihood function , as usually done in the mixed- 
effects model framework (Pinhe iro and Bated . l200dh . However, with respect to the 
full parameter vector </>, we have a non-convex objective function and hence, we 
have to deal with a non-convex problem. This requires a more general framework 
in theory as well as in computation. In the following sections, we discuss how to 
address this issue. 



2.3 Prediction of the random-effects coefficients 

We predict the random-effects coefficients 6j, i — 1, . . . , N by the maximum a pos- 
teriori (MAP) principle. Denoting by / the density of the corresponding Gaussian 
random variable, we define 

hi = argmax/(b,|yi, . . . ,y N , (3,8,<r 2 ) = argmax/(b i |y i , f3, 6, a 2 ) 

bi bi 

/(^|b a ,/3,a 2 )-/(b a |0) 

= arg max 

bi f{yi\(3,e,a 2 ) 



argmin \\\\y t - X,(3 - ZM\\ 2 + bj^ e 1 b l 

bi I v 



From this we get bi = [Zj Zi + cr 2 * g 1 ] -1 Zfri where n = (y, - Xiff) is the 
(marginal) residual vector. Since the true values of /3, 6 and a 2 are unknown, the 
bi's are predicted by b t = [Zj ' Z t + a 2 ^^ 1 ]^ 1 Zffi with fi = {yi — using the 

estimates from 

2.4 Selection of the regularization parameter 

The estimation requires to choose a regularization parameter A. We propose to use 
the Bayesian Information Criterion (BIC) defined by 

BICx := -2e(p, 0, a 2 ) + log N T -df Xl (10) 

where df \ := |{1 < k < p : flk ^ 0}| + dm\(0) is the sum of the number of the 
non-zero fixed regression coefficients and the number of variance-covariance param- 
eters. The use of |{1 < k < p : $k ^ Oil as a measure of the degrees of freedom 



is motivated by the work of IZou et al.l (|2007f) who show that the expected number 



of degrees of freedom for the Lasso in a linear model is given by the number of 
non-zero estimated coefficients. 

Obviously, there are other tuning parameter selection methods, for example cross- 
validation and AlC-type criteria, among others. Advocating the BIC as selection 
criterion is based on our empirical experience that it performs best in both simula- 
tions and real data examples (see Section [5] and [5]) . 

2.5 Adaptive ^-penalized maximum likelihood estimator 

Due to the bias of the Lasso, IZoul ([2006) proposed the adaptive Lasso. For some 
given weights Wi, . . . , w p , the adaptive ^i-penalized maximum likelihood estimator 
has the following objective function instead of ([5]): 

Q^'-'^OS, 9, a 2 ) := I \og\V\ + \{y - Xpfv^iy - X(3) + 

k=l 
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and hence 

j> = (0,6,a 2 ) = argmin Q% L -- w >(p,0,o*). (11) 

/3,0,<r 2 >O,¥>O 

The weights W\ , . . . , w p may be calculated from an initial estimator /3j n jt in © with 
u>k '■= l/|/3ini*.fc(A)| for k = 1, . . . ,p. Unless specified otherwise, we employ these 
weights. 



3 Theoretical Results 

In the high-dimensional setting with p 3> Nt, the theory for penalized estimation 
based on convex lo ss functions with an ^i-penalty is well studied, see for example 



van de Geer ( 20081) . From © and © we see that we are dealing with a non-convex 



loss function, due to the variance parameters 6 and cr , and a convex ^i-penalty 



To the best of our knowledge, onlv lStadler et al.l ([20101 ) consider high-dimensional 



non-convex ^-penalized smooth likelihood pr oblems. In this section, we build upon 
the theory presented in Stadler et all ( 2010l ) and extend their results to prove an 



oracle inequality for the adaptive ^-penalized estimator (ITT1) . 
We use the following framework and notation. Let i = 1,...,N as before and 
Ui = n > 1 the same for all i. Denote by yi s y C K™ the response variable. Let 
Xi be the fixed covariates in some space X n C R nxp and Zi C X{. The latter 
can be assumed without loss of generality, since we can assign to every variable a 
fixed effect being equal to zero. Define the parameter </> T := (/3 T , T , 2 logo - ) = 
(/3 T , T , q) — (/3 T , J7 T ) g and denote by </>o the true parameter vector. For 

a constant < K < oo, consider the parameter space 

* = {0 T = 03 T ,T7 T ) : sup \x T (3\ < K, < K, * > 0} e R p+9 * +1 , (12) 

where ||tj||oo = max; \rji\. We modify the estimators in © and (fTTj) by restricting 
the solution to lie in the parameter space <fr: 

4> := arg min | i log | V| + i(» - X(3) T V' 1 (y - X(3) + A V |/3 fc | 1 , (13) 

:= arg min J ~ log | V| + ~ (» - X/3) T V" 1 (y - X/3) + A V w k \/3 k | 1. (14) 
I 2 2 *=1 J 

Now, let fcj>.Xi,Zi be the Gaussian density for with respect to the above parametriza- 
tion. Since we use the negative log-likelihood as loss function, the excess risk coin- 
cides with the Kullback-Leibler distance: 

£x,z{<t>\<t>°) = / l°g {^jj Uo,x,zdfi, (15) 

where /i denotes the Lebesgue measure, and we define the average excess risk as 

1 N 

£x u ...,x N ,z u .,.,z N (</>\<t>o) = J^22£Xi,Zi{^\4>0)- 

i=l 

In the sequel, we drop the indices x,z and x 1 ,...,x N ,z 1 ,...,z N , respectively. 
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3.1 Consistency for the ^-penalized estimator 

We require only one condition for consistency. It is a condition on the random-effects 
design matrices Zi. 



Assumption 1 The eigenvalues of Zj Zi, denoted by f or i = 1, ■ ■ ■ , N , 

are bounded: Uj < K < oo for all i and j , with K from Uty) . 
Now we consider a triangular scheme of observations from (|5|) : 

y, = X,f3 N + Z,b t +£i i = l,...,N, (16) 

where the parameters (3m and tjjv are allowed to depend on N. We study consistency 
as N —> oo but the group size n is fixed. Moreover, let us use the notation a V b :— 
max{o, b}. 



Theorem 1 (Consistency) 

Consider model \lb}) and the estimator \1S\) . Under Assumption^ and assuming 



" ■ " \y \og 4 (N)\og(pVN))' V N 

for some C > 0, any global minimizer (p as in U3\) satisfies £(<fi\<po) = op(l) as 

N ->• oo. 

A proof is given in the Appendix in the Supporting Information. The condition 
on ||/3o,jv||i is a sparsity condition on the true underlying fixed-effects coefficients. 



3.2 Oracle inequality for the adaptive ^-penalized estimator 

We now present an oracle optimality result in non-asymptotic form for the adap- 
tive estimator (and thereby covering also the non- adaptive case). Preliminary, we 
introduce some notation and two further assumptions. 



Assumption 2 

(a) Let (u^Yj—i be the eigenvalues of Z^Zj for i — 1,...,N. At least two 

eigenvalues are different, i.e. for all i 3ji ^ j% G {1, . . . ,n} such that wj* ^ 
(») 

(b) For i = 1, . . . , N , the matrices £li defined by 

(n,) riJ = tr (vr^^V^-^p-) r, s = 1, . . . ,q* + 1 

V d(j) p+r d(t>p + s J 

are strictly positive definite. 

Remark. In the special case = 9 2 I, Assumption [2] (b) automatically holds. 
Let S(f3) — {1 < k < p : (3k ^ 0} be the active set of f3, i.e. the set of non-zero 
coefficients, and (3/c = {C3k ■ k G fC} for fC C {1, . . . ,p}. We denote by So = S((3q) 
the true active set and by sq = |So| its cardinality. Write Xf — [x\, . . . ,x l n ) and 
define 

N n 
»=1 3=1 
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Assumption 3 (Restricted Eigenvalue Condition) 

There exists a constant k > 1, such that for all (3 € W satisfying \\/3s°\\i < 6||/3s ||i 
it holds that ||/3<j ||| < n 2 (3 T T, N . n [3. 

A discussion of this assumption can be found in Bickel et al. ( 20091 ) and van de Geer and Buhlmaiml 
(j2009t ). Define 



A = M N log N 



log(p V A) 

TV : 



(17) 



where Mjv is of order log N and an exact definition is given in the proof of Theorem 
[TJ For any T > a\, let J be a set defined by the underlying empirical process (see 
(A. 6) in the Supporting Information). It is shown in the proof of Theorem [T] that 
the set J has large probability, 



¥[J] >l-a 2 exp 



T 2 log 2 A^ log(p V N) 



P 



log AT A 1 - 25 



for N sufficiently large and some constants ai, a2, 03, £, p > 0, see Lemma 2 and 3 
in the Appendix A (in the Supporting Information). 



At this point, we could conclude an oracle result in the way of IStadler et al 



(2010). However, we extend that result and present an oracle inequality involving 
||/3 — (3o\\ i instead of ||/3sg||i for the ^-penalized as well as the adaptive ^-penalized 
estimator. 



Theorem 2 (Oracle result) 

Consider the weighted i\-penalized estimator Jj^[). Suppose that for some 5 > 0, 



w k 



< 1/(5 fee So, 
>1/S k^S Q . 



Under Assumptions^ [H and\^ and for A > 2T<5Ao, we have on the set J defined 
in (A. 6), 

£{<t>weight\4>o) + 2(A/<5 - TXo)\\$ wei ght - A) Hi < 9(A/(5 + rA ) 2 CgK 2 so. 

The proof is given in the Appendix B. Application of Theorem [5] with all weights 
equal to one (5 = 1) gives an oracle result for the ^-penalized estimator, which we 
will use as initial values for the adaptive Lasso procedure. 



Corollary 1 Let 

4>init ■= ($init,Oinit,Qinit) ■= arg min Qx'"l t {(3, 8, g) , 



be the initial estimator in fi^| ) (i.e., the estimator with all the weights equal to one). 
Under Assumptions^ [H and[3 and for \i n u > 2TAo, we have on J , 

£{4>init\4>o) + 2(A mit - T\ o )\\0 mit - /3o ||i < 9(A mit + TA ) 2 CoK 2 s - (18) 

It is clear that the ^-estimation error bound implies a bound for the £oo estima- 
tion error as well. When the underlying true coefficients /3o,fe, k £ So are sufficiently 
much larger in absolute value than the ioo- estimation error bound, one can perfectly 
distinguish between active and non-active set. This argument is applied in the next 
corollary to the adaptive Lasso with estimated weights. 
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Corollary 2 Let 



4>ada P ■= {$ada P :d adap ,Q adap ) := arg min Q™ 1 ' " ,u,p (f3, 6, g) , 

</>s# adap 

be the adaptive estimator with weights Wk = l/\f3init,k\> k = 1, . . . ,p as in H3\) . 
Assume that for all k € Sq, 

\/3 ,k\> 2S init, (19) 

where 

_ 9(X mlt +TX ) 2 c 2 K 2 s > 

°imi •— ^7T ^ , s £ HPinit — Po||l 

£\Knit — J AO) 

is a bound of the t\-estimation error of the initial (3i n it- Suppose moreover that 
Assumptions^ [H and\3\ are met. Then, for \ a dap > ^T^initXo, and on the set J , 

£ (d) a dap |0O ) + 2(A a dap/<5mit — 7 1 Ao)||/3 a dap — A)||l 
< 9{^adap/Sinit + T\ ) 2 <?qK 2 Sq. (20) 



We call condition (1191) a "betamin" condition. It is clearly very restrictive, but 
allows for an easy derivation of th e oracle result. The "be tamin" condition can 
indeed be substantially refined. In Ivan de Geer et al. (2010), one can find similar 



oracle results, and in addition variable selection results, for the adaptive Lasso in 
the linear model, without "betamin" conditions. These results require introducing 
various versions of restricted eigenvalues and sparse eigenvalues, and can be gener- 
alized to the current setting. Since a full presentation is rather involved, we have 
confined ourselves to the simplest case. 

Recall that in (|17p . we choose Ao of order log 2 Ny/log(p V N)/N. When we 
also choose Xinit of this order, we find, modulo the restricted eigenvalue k and the 



constants T and cq, that the right-hand side of the oracle result (|18[) for the initial 
estimator is of order 

and that 



<5init X log iVy — S - 

The tuning parameter for the adaptive Lasso can then be taken of order 

4 logfrVJV) 

Xadap ~ fog N — S . 

The right-hand side f|20[) of the oracle result for the adaptive estimator is then of 
the same order as the one for the initial estimator. 

Assuming "betamin" conditions, the results in Corollary [T] and [2] imply the 
variable screening property motivated already in Q . 

Corollary 3 

1) For the i\-penalized (initial) estimator U3\) , assume 

mm|/3 , fe | > 6^ = — — — . 

k <£(Ainit — J AO) 

Then, under the assumptions of Corollary Ql on the set J , 
Sq C Sinit = {1 < k < p : Pinit,k ^ 0}. 
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2) For the adaptive ^-penalized estimator in Corollary^ assume 

. , | 9(Kdap/Sinit + TXq) 2 cIk 2 Sq 
k £\Aadap/ Oadap ~ 1 

Then, under the assumptions of Corollary^ on the set J , 

So C Sadap = {1 < k < p : fladapM ^ 0} 

The proof of Corollary [3] is given in the Appendix B. 

4 Computational algorithm 

The algorithm for the estimators in (|9]) a nd (flTT) are based on th e Block Coordinate 



Gradient Descent (BCGD) method from lTseng and Yunl (|2009l ) 



The main ideas of our BCGD algorithm are that we cycle through the coordinates 
and minimize the objective function Q\{.) with respect to only one coordinate 
while keeping the other parameters fixed (i.e. a Gauss-Seidel algorithm). In each 
such step, we approximate Q\(.) by a strictly convex quadratic function. Then, 
we calculate a descent direction and we employ an inexact line search to ensure a 
decrease in the objective function. 

B CGD algorithms are u sed i n iMeier et al. LJ200 8|) fo r the group Lasso as well as 
in Wu and Lange (|200§ ) and [Friedman et al.l (|2010[ ) for the ordinary Lasso. We 



remark that IMeier et al.1 (|2008f) have a block structure due to the grouped variables 
whereas we only focus on ungrouped covariates. Thus the word "block" has no 
meaning in our context and consequently, we omit it in the subsequent discussion. 
Furthermore, the ordinary Lasso has only regression parameters to cycle through 
in contrast to our problem involving two kinds of parameters: fixed regression and 
variance-covariance parameters. 

Let us first introduce the notation and give an overview of the algorithm before 
proving that our optimization problem achieves numerical convergence. All the 
details as well as some computational aspects are deferred to the Appendix C in 
the Supporting Information. 

Let 4> T = (f3 T ,rj T ) e M. p+q +1 be the parametrization introduced in the previous 
section. Define the functions 

P(0):=f> fe | , g{<t>) := \\og\V + \(y - X (1) T V (inY 1 {y-X (5). 

k=l 

Now © can be written as <p\ = argmin^ Q\(4>) := g(<fi) + XP((f>). Letting Bj the 
jth unit vector, the algorithm can be summarized in the following way: 

Algorithm 1 (Coordinate Gradient Descent) 

(0) Let 0° E M p+9 * +1 be an initial value. 

For I = 0, 1,2, . . let S £ be the index cycling through the coordinates {1}, {2},. . . , 

{p + q*}, {p + ?* + !} 

(1) Approximate the second derivative d ^ ^ 2 Q\((p e ) by h l > 0. 

(2) Calculate the descent direction 

S := argmin d6R \g(tf) + ^g(4> e )d + l/2d 2 // + XP{4>" + de s e)}. 
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(3) Choose a stepsize a e > and set (p l+1 = <p e + a^d^e s e such that there is a 
decrease in the objective function. 

until convergence. 

The details of (0) - (3) and further computational issues are given in the Ap- 
pendix C of the Supporting Information. An implementation of the algorithm can 
be found in the R package lmmlasso, which is available from the first author's web- 
site (http://stat.ethz.ch/people/schell) and will be made available on CRAN. 

The convergence properties of the CGD algorithm are described in the following 
theorem. 

Theorem 3 (Convergence of the CGD algorithm) 

If ((/> )i>o is chosen according to Algorithm^ then every cluster point of {4> l }t>o 
is a stationary point of Q\ (</>) . 

The proof is given in the Appendix C. 

In general, due to the non-convexity of the optimization problem, the CGD al- 
gorithm may not achieve the global optimum. 



5 Simulation study 



In this section, we assess the empirical performance of the ^-penalized maximum 
likelihood estimators © and (fTTj) in different kinds of simulation examples. We 
study several performance measures and compare the proposed method with Lasso 
and linear mixed-effects methods, if possible. 

After some introductory remarks, we focus on high-dimensional examples. The 
simulation study for the low-dimensional setting is provided in the Supporting In- 
formation. The application of the new procedure on a real data set is illustrated in 
the next section. 

Hereafter, we denote by ImmLasso the ^-penalized maximum likelihood esti- 
mator ([9]), by ImmadLasso the adaptive ^-penalized maximum likelihood estimator 
(TTT|) and by Ime the classical lin ear mixed-effects model provided by the R package 
nlme (Pinhciro and Batestl2000| ). Furthermore, let L asso denote the standard Lasso 



(jEfron et all l2004h and adLasso the adaptive Lasso (jZoul . [2006) where the regular- 
ization parameter is chosen by minimizing the Bayesian Information Criterion. 

As an overview, let us summarize the most important conclusions from the 
simulation studies: 

(a) The variability of the estimated fixed-effects parameters 0k is much smaller 
if there is no corresponding random effect (bi)k for i — 1, . . . , N, for all Ime, 
ImmLasso and ImmadLasso. 

(b) In the high-dimensional framework, the following aspects appear (and are 
virtually not observable in the low-dimensional setting): 

1. Penalizing fixed-effects covariates which also incorporate a random effect 
causes bias problems. To be more specific, let us assume that the pe- 
nalized kth covariate has a fixed and a random-effects coefficient, i.e. fik 
and (pi)k, respectively. If the regularization parameter A is large and /3fc 
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subject to penalization, then /3& is shrunken towards zero. Thereby, the 
estimate of the corresponding variance parameter gets large and ibijk 
has a bias related to the amount of shrinkage in f3k- As a consequence, 
covariates with fixed and random effect should no be subject to penal- 
ization. 

An adaptive procedure (jllj) with appropriate weights may reduce this 
adverse effect, but it does not overcome t he aforementioned problem com- 
pletely. The work of lBondell et al. ( 201dh covers only the low-dimensional 



case and the authors do not present any parameter estimates in the sim- 
ulation study. 

(c) There is a remarkable reduction of the estimated error variance a 2 when in- 
corporating the random-effects structure in ImmLasso, ImmadLasso and Ime 
compared with Lasso and adLasso. 

(d) The variability of the Lasso and adLasso coefficient estimators are larger than 
the corresponding variability of the mixed-effects model approaches. 

(e) If we focus on the identification of random-effects covariates, we suggest using 
a diagonal structure for ^ and then eliminating those random-effects covari- 
ates with a small variance. An elaborate discussion of the selection of the 
random-effects structure is beyond the scope of this paper. In Section |B] we 
suggest a strategy how to remedy this problem. 

In all subsequent simulation schemes, we restrict ourselves to the case where all 
groups have the same number of observations, i.e. we set n% = n for i = 1, . . . , N. 
Let the first column of Xj be the (non-penalized) intercept. We assign Zi C Xi such 
that the columns of Zi correspond to the first q columns of Xi. This means that 
the first q variables have both a fixed-effects coefficient f3k and a random-effects 
coefficient (bi)k for i = 1, . . . , N and k — 1, . . . , q. The covariates are generated 
from a multivariate normal distribution with mean zero and covariance matrix X 
with the pairwise correlation = p\ k ~ k I and p — 0.2. Denote by (3q the true 
fixed effects and by sq := #{1 < k < p : /?o,fc =/= 0} the true number of non-zero 
coefficients. Unless otherwise stated, we set VP = 9 2 I. In all subsequent tables, a 
non-penalized fixed-effects coefficient is marked by an asterisk * . 



5.1 High-dimensional setting 

We study four examples in the high-dimensional setting (/?o,i = 1 is the unpenalized 
intercept). 

Hf. N = 25, n = 6, N T = 150, p = 300, q = 2, a 2 = 0.25, 6 2 = 0.56 and s = 5 
with p = (1,2,4,3,3,0,...,0) T . 

H 2 : N = 30, n = 6, N T = 180, p = 500, q = 1, a 2 = 0.25, 6 2 = 0.56 and s = 5 
with A, = (1,2,4,3,3,0,...,0) T . 

H 3 : N = 30, n = 6, N T = 180, p = 1000, q = 3, a 2 = 0.25, <9 2 = 0.56 and s = 5 
with fa = (1,2,4,3,3,0,...,0) T . 

H 4 : N = 25, n = 6, N T = 150, p = 300, o 2 = 0.25, 

/3 0\ 
* = 3 
\0 2/ 

and So = 5 with f3 = (1, 2, 4, 3, 3, 0, . . . , 0) T . In contrast to the previous 
examples, we fit a wrong model assuming that ^ is diagonal with dimension 
4. 
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The results in the form of means and standard deviations (in parentheses) over 
100 simulation runs are depicted in Table [5J [3] and 2J Therein, denotes the 

cardinality of the estimated active set and TP is the number of true positives. 



Table 2: Simulation results for Hi, H2 and H3 



Model 



Method 



\S(f3)\ 



TP 







02 



Ps 



Pa 



true 



0.25 



0.56 



1.01* 
0.16) 
L.01* 
0.16) 
L.01* 
0.17) 
1.01* 
0.17) 



2.05* 
(0.16) 
2.03* 
(0.16) 
2.07* 
(0.19) 
2.02* 
(0-18) 



ImmLasso 
ImmadLasso 
Lasso 
adLasso 



6.70 
(2.14) 

6.59 
(2.02) 

6.29 
(1.46) 

6.29 
(1-46) 



5 

(0) 
5 

(0) 
5 

(0) 
5 

(0) 



0.29 
(0.05) 

0.22 
(0.04) 

1.36 
(0.27) 

1.16 
(0.24) 



0.52 
(0.12) 

0.52 
(0.12) 



1.00* 
0.15) 
1.00* 
0.15) 
L.00* 
0.15) 
L.00* 
0.15) 



3.86 
(0.06) 

3.98 
(0.06) 

3.76 
(0.10) 

3.98 
(0-10) 



2.90 
(0.06) 

2.99 
(0.05) 

2.84 
(0.11) 

3.00 
(0-11) 



2.88 
(0.06) 

3.00 
(0.05) 

2.79 
(0.10) 

2.99 
(0-10) 



Ho 



ImmLasso 
ImmadLasso 
Lasso 
adLasso 



6.65 
(1.71) 

6.53 
(1.64) 

6.84 
(2.02) 

6.84 
(2.02) 



5 

(0) 
5 

(0) 
5 

(0) 
5 

(0) 



0.28 
(0.04) 

0.22 
(0.03) 

0.87 
(0.19) 

0.72 
(0-17) 



0.56 
(0.17) 

0.55 
(0.17) 



1.02* 
0.15) 
1.02* 
0.15) 
1.03* 
0.17) 
1.03* 
0.16) 



1.90 
(0.04) 

2.00 
(0.04) 

1.84 
(0.08) 

2.00 
(0.07) 



2.00* 
(0.15) 
2.00* 
(0.15) 
2.02* 
(0.18) 
2.02* 
(0-17) 



3.91 
(0.05) 

3.99 
(0.04) 

3.88 
(0.07) 

4.00 
(0.07) 



4.04* 
(0.15) 
4.00* 
(0.15) 
4.06* 
(0.19) 
3.99* 
(0-18) 



2.92 
(0.04) 

3.00 
(0.04) 

2.88 
(0.09) 

3.00 
(0.08) 



2.89 
(0.05) 

2.99 
(0.04) 

2.83 
(0.08) 

2.98 
(0.08) 



ImmLasso 
ImmadLasso 
Lasso 
adLasso 



6.17 
(1.74) 

6.12 
(1.70) 

5.93 
(1.48) 

5.93 
(1-48) 



5 

(0) 
5 

(0) 
5 

(0) 
5 

(0) 



0.29 
(0.05) 

0.23 
(0.04) 

1.94 
(0.36) 

1.69 
(0.32) 



0.52 
(0.10) 

0.53 
(0.10) 



2.84 
(0.07) 

2.99 
(0.07) 

2.70 
(0.11) 

2.98 
(0-12) 



2.84 
(0.06) 

2.99 
(0.06) 

2.70 
(0.13) 

2.97 
(0-12) 



indicates that the corresponding fixed-effects coefficient is not subject to penalization 



Table 3: Simulation results for H4 



Method 


\sm 


TP 




Pi 


P2 


P 3 


Pa 


P 5 


true 


5 


5 


0.25 


1 


2 


4 


3 


3 


ImmLasso 


5.56 


5 


0.26 


0.95* 


1.99* 


3.97* 


3.04* 


2.82 




(0.97) 


(0) 


(0.05) 


(0.31) 


(0.38) 


(0.31) 


(0.07) 


(0.07) 


ImmadLasso 


5.56 


5 


0.22 


0.95* 


1.99* 


3.97* 


3.00* 


3.00 




(0.97) 


(0) 


(0.04) 


(0.31) 


(0.38) 


(0.30) 


(0.07) 


(0.06) 


Lasso 


6.84 


5 


7.85 


0.94* 


2.01* 


3.99* 


3.11* 


2.36 




(12.18) 


(0) 


(1.81) 


(0.38) 


(0.47) 


(0.38) 


(0.23) 


(0.28) 


adLasso 


6.79 


5 


7.25 


0.95* 


2.02* 


4* 


2.98* 


3.01 




(11.68) 


(0) 


(1.76) 


(0.37) 


(0.47) 


(0.38) 


(0.22) 


(0.29) 



* indicates that the corresponding fixcd-cffccts coefficient is not subject to penalization 



Table 4: Mean covariance estimates for H4 



Method 




*22 


*33 


*44 


true 


3 


3 


2 





ImmLasso 


2.82 


2.94 


1.85 


0.01 




(0.80) 


(0.88) 


(0.62) 


(0.02) 


ImmadLasso 


2.81 


2.94 


1.84 


0.01 




(0.79) 


(0.88) 


(0.62) 


(0.02) 



Let us sum up the simulation results for the models H1-H4. As in the low- 
dimensional setting (see Appendix D), the estimated active set is sparse and all 
methods include the true non-zero coefficients. 

Table [5] reveals that ImmLasso and ImmadLasso reduce the error variance remark- 
ably in comparison with Lasso and adLasso. Nevertheless, ImmLasso overestimates 
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the true value of a 2 whereas ImmadLasso underestimates a 2 . We observe, in partic- 
ular for H4, that a maximum likelihood approach (in contrast to a restricted maxi- 
mum likelihood approach) gives biased variance-covari ance estimators. It is possible 
to implement a REML-type approach ( Ni et all l2010f) in the high-dimensional set- 
ting in order to reduce the bias in the variance parameters. However, we have 
observed that i) the number of (Gauss-Seidel) cycles increases and ii) the algo- 
rithm may fail to converge. 

In all models, we do not penalize the covariates with both a fixed and random effect. 
Without doing this (not shown here) , the fixed effects would be set to zero whereas 
the estimated between-subject variability 9 2 would increase. As a consequence, the 
predicted random effects are too large and are not centered at zero, but around 
the true fixed effect. Hence this would result in a model which docs not fulfill the 
assumptions in ([5]) anymore. 

Table [2] and [3] reveal that the variability of the fixed effects with no corresponding 
random effect is approximately half of the non-penalized coefficients. This differ- 
ence of estimation variability is also observed in the classical linear mixed-effects 
framework (see Ime in Table 7 and 8 in the Appendix D). Besides, ImmLasso has a 
bias towards zero, which is notably smaller than that from the Lasso. As expected, 
this bias can be reduced by ImmadLasso. 

Concerning H4, it is worth to point out that although not knowing the true covari- 
ance structure, we may use a diagonal structure for St and then drop the variances 
which are close to zero. A suggestion how to use this idea in a real data set is 
presented in the next section. 



5.2 Within-group prediction performance 

We now turn to consider the performance of the proposed methodology concern- 
ing within-group prediction. We compare the predictive performance between six 
different Lasso procedures. In doing so, denote by ImmLasso, ImmadLasso, Lasso 
and adLasso the procedures from the previous subsection. In addition, let cv-Lasso 
be a cross-validated Lasso and cv-adLasso a cross-validated adaptive Lasso whose 
A-value is chosen by 10-fold cross-validation. 

We fix the following scenario: N — 25, rii = 6 for i = 1, . . . , N, q = 3, sq = 5 with 
/3o = (1,1.5, 1.2,1, 2, 0,...,0) T , o~ 2 = 1 and p = 0.2. We only alter the number 
of fixed covariates p and the variance component 6 2 . For measuring the quality of 
prediction, we generate a test set with 50 observations per group and calculate the 
mean squared prediction error. The three models considered are 

Pi: p = 10, P 2 : p = 100 and P 3 : p = 500. 

The results are shown in Table [SJ 

We see that the methods differ slightly for 9 2 — which corresponds to no 
grouping structure. As 8 2 increases, the mean squared prediction error increases 
less for the ImmLasso and the ImmadLasso than for the other methods. These 
results highlight that we can indeed achieve prediction improvements using the 
suggested mixed-effects model approach if the underlying model is given by ([5]). 



6 Application: riboflavin data 

Data description. We illustrate the proposed procedure on a real data set which 
is provided by DSM (Switzerland). The response variable is the logarithm of the 
riboflavin production rate of Bacillus subtilis. There are p — 4088 covariates mea- 
suring the gene expression levels. We have N — 28 groups with m € {2, . . . , 6} 
and Nt = 111 observations. We standardize all covariates to have mean zero and 
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Table 5: Mean squared prediction error for three simulation examples. 



Modol 


e 2 


ImmLasso 


lmmadLasso 


Lasso 


adLasso 




cv-adLasso 


Pi 





1.01 


1.02 


1.00 


1.01 


1.05 


1.01 


(p = 10) 


0.25 


1.33 


1.29 


1.76 


1.84 


1.81 


1.84 




1 


1.66 


1.55 


3.74 


3.74 


3.88 


3.77 




2 


1.67 


1.80 


5.92 


6.25 


5.94 


6.25 


P-2 





1.12 


1.02 


1.26 


1.09 


1.20 


1.14 


(p = 100) 


0.25 


1.51 


1.38 


1.75 


1.75 


2.06 


1.75 




1 


1.94 


1.86 


4.35 


4.53 


4.61 


4.23 




2 


2.49 


1.95 


7.04 


7.02 


7.09 


6.98 


p 3 





1.22 


1.07 


1.18 


1.26 


1.24 


1.58 


(p = 500) 


0.25 


1.83 


1.58 


2.63 


2.67 


2.98 


3.58 




1 


2.00 


1.85 


4.35 


3.78 


4.14 


4.85 




2 


2.54 


2.04 


10.30 


8.26 


9.47 


11.28 



variance one. 



Model selection strategy. Preliminary, we address the issue of determining those 
covariates which have both a fixed and a random-effects coefficient. In other words, 
we specify the matrices Zi C Xj. Since we have to deal with high-dimensio nal, 
low sample size data, the various tools proposed in iPinheiro and Bates (2000) for 



determining Zi can hardly be applied. Instead, we suggest the following strategy: 

(1) Calculate an ordinary Lasso solution Lasso (with cross-validation) and define 
the active set S init := {1 < k < p : (3% asso ^ 0}. 



(2) For each I € SWiit, fit a model in which only the Zth covariate has a random- 
effects coefficient. Denote the corresponding variance estimate by Of. 

(3) Let > 0%, > . . . > be the ordered estimated variances from (2). 
Then for k > define the set K K := {I G S lnlt : 9? > n}n{l £ S lnlt : BIC §2 < 
BICq} where BICq is the BIC of the Lasso solution in (1). 

(4) Fit a model with Zi = Xj 1 '* (where consists of the variables included 
in 1Z K ) and \& being diagonal and keep the non-zero elements of \&. 

By doing so (and setting k = 0.05), it seems reasonable to fit a model wherein two 
covariates have an additional random effect. Denoting these variables as k\ and ^2, 
the model can be written as 

Uij = xJ j j3 + bik 1 z i jk 1 +bik 2 Zijk 2 +eij i = l,...,N, j = l,...,m (21) 

with b lkl ~N(0,9 2 ki ), b ik2 ~N(0,6 2 k2 ) and e tf ~ ^(0, a 2 ). 

Results. We compare the results of ImmLasso and lmmadLasso with Lasso and 
adLasso. The variance component estimates, the cardinality of the active set and 
the rank R of five fixed-effects coefficients are shown in Table [5] The ranking is 
determined by ordering the absolute values of the fixed-effects coefficients. 

From Table [SI we see that the error variance of the Lasso may be considerably 
reduced by the ImmLasso. Although the variance 6\ is small, the BIC of this model 
is smaller than that of the model including only ki as random covariate. It is note- 
worthy that 53% of the total variability in the data set is due to the between-group 
variability. This strongly indicates that there is indeed some variability between 
the groups. As might have been expected from the simulation results, the active 
set of ImmLasso and lmmadLasso is smaller than the active set from Lasso and 
adLasso. The ranking indicates that there is one dominating covariate whereas the 
other coefficients differ only slightly between the four procedures (not shown). 
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Table 6: Results for ImmLasso, ImmadLasso, Lasso and adLasso of the riboflavin 
data set 





ImmLasso 


ImmadLasso 


Lasso 


adLasso 




0.18 


0.15 


0.30 


0.20 




0.17 


0.08 








0.03 


0.03 






\S{(3)\ 


18 


14 


21 


20 


R h 


1 


1 


1 


1 


R h 


2 


2 


•1 


6 


R h 


3 


3 


3 


5 




4 


13 






R h 


5 


6 


6 


7 



7 Discussion 

We present an .^-penalized maximum likelihood estimator for high-dimensional lin- 
ear mixed-effects models. The proposed methodology copes with the difficulty of 
combining a non-convex loss function and an ^i-penalty. Thereby, we deal with the- 
oretical and computational aspects which are substantially more challenging than 
in the linear regression setting. We prove theoretical results concerning the con- 
sistency of the estimator and we present a non-asymptotic oracle result for the 
adaptive ^-penalized estimator. Moreover, by developing a coordinate gradient de- 
scent algorithm, we achieve provable numerical convergence of our algorithm to at 
least a stationary point. Our simulation studies and real data example show that 
the error variance can be remarkably reduced when incorporating the knowledge 
about the cluster structure among observations. 

Supporting Information 

Additional Supporting Information can be found in the Appendices of this article: 

Appendix A. Proof of Theorem 1 from Section [3] 

Appendix B. Proof of Theorem 2 from Section [3] 

Appendix C. Computational details of Algorithm 1 in Section 2J 

Appendix D. Simulations for the low-dimensional setting in Section [SJ 
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Appendix A: Proof of Theorem 1 

The proof consists of three parts. Firstly, we need a n ineq uality ensuring that 
Lemma [2] holds. Secondly, we show that the probabili ty (|A.2[) in Lemma 121 is large. 
And for completion of our proof, we can then refer to Sta dler et al. ( 2010l ). 
From model (5), the log- likelihood function of yt with respect to the parametrization 
in (12) is given by 

MVi) : = ~\ lo s( 27r ) - \ lo S \ z i*eZi + e'I\ - - XipfiZi^eZ? + eBI)-\ yi - X t (3) 
Then, define the score function s^(yi) := d/d<p£^,(yi). 



Lemma 1 Under Assumption 1, there exist constants c\, 02,03 € 



sup || s^y, 
0e# 



< Gi{vi) := ci + c 2 1 1 2/i 1 1 2 + C3 1 1 2/i 1 1 2 * = 1, 



such that 
,.,N. 



Proof. The proof is straightforward using on the one hand the Cauchy-Schwarz 
inequality and the fact that the induced L2-norm of a square matrix A is given by 
|| -<4|| 2 = y/\ ma x{A T A), where X ma x denotes the largest eigenvalue. On the other 
hand, we conclude from Assumption 1 and (12) that the eigenvalues of Z^qZ t 
and Zd/de^eZ? arc bounded. □ 

Now we introduce the empirical process and present a result which controls the 
increments of it. The Lemma below gives a lower bound for the probability that 
the increments are small. Afterwards, we show that this lower bound is large. 
Define the empirical process 



v N {<t>) 



i N 



and 



A = -M/v log N 



log(p V N) 



N 



(A.l) 



Lemma 2 Under Assumption 1 and for constants a\, 02 and depending on K 
and for all T > a\, 



sup 



v N (4>) - v N (<p ) 



(||/3-A)||i + ||»7-»7o||2) VA 
with probability at least 



<TX 



1 — 02 exp 
where d := n + q' 



T 2 log 2 N log(p V N) 



N 



l ±F( yi )> TX2 ° 



N 



dK 



1 and 



F{yi) = G 1 (y i )l {Gliyi)>MN} +E G 1 (y i )l {Gl{yt)>MN } 



(A.2) 



(A.3) 
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The proof of Lemma [2] is given in 
third term is small in our setting. 



Stadler et al, 



(|20ld) . Next, we show that the 



Lemma 3 There are constants b\ and 62 depending on K and n, a constant p 
depending on T , n and K such that for any < e < 1/2 and Mjv := 6i(2ylogiV + 
\fb~2) 1 we have 

/AT ~ \ 

1 



< 



_P 



Proof. In the subsequent discussion, if A is a constant, we assume throughout that 
N is large enough such that Mm — A > 0. From (|A.1|) we see that it suffices to 
show that for a constant 04, 



\ i=i 



> 



logiV 



iV 



< 



1 



log TV AT 



l-2e ' 



(A.4) 



The expectation in (|A.3|) only affects the constants in the remainder of the proof. 
Therefore, we omit this term in the sequel. From 



'[ci + c2||j/i||2 + c 3 |M|l > Mjv] < P \\\y z \\1 > 



A/jV - C!\2 
2^T 



2 A/jv - ci - 
IWilla > 



2c 3 



and the fact that Mat — > 00 , we deduce that we can restrict ourselves to the analysis 
of Pflll/iHl > Mjf]. For the sake of notational simplicity, we will leave out the index 
i and show that for an appropriate definition of Mjy, 



Ivlli > M N ] < j^. 



(A.5) 



Denote by xt(^) the noncentral x 2 distribution with v degrees of free dom and 
non-centrality parameter 8. The following identity holds (|Liu et all |2008[ ). 

Claim 1 If y ~ Af n (fJ,,V) with (j, G K" and V G M" x " positive definite, then 
II2/II2 = y T 2/ = S"=i ^iXi(^j) w/lere {X?(^)}"=i are independent, Xj for j = 
1, . . . ,n are i/ie eigenvalues of V and if V = UDU T for an orthonormal matrix 
U, then 8 j =(U T V- 1/2 n) 2 j . 



Claim 2 



\ X \{8) > M] < 



exp 



(VM-V8f 



Proof. If X ~ Af(n, C 2 ), then by definition of the noncentral x 2 distribution (X/Q 2 ~ 
Xl=i{5 = OVC) 2 )- Hcn c e HxiOO > M] = 2 • P[f > n/M] = 2 • > 
VM-VS] = 2 • S(VM-VS), where := / t °° exp(-u 2 /2)du is the survival 
function of a standard Gaussian random variable for which the following inequalities 
hold: 

' exp(-i 2 /2) < S(t) < i— Lexp(-i 2 /2) for t > 0. 



1 + t 2 
Thus, we conclude 



* V2tt 



»[X?(*) > M] < 



M — y/S \[2t{ 



exp 



□ 
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Claim 3 For M N j := (2y/\ogN + y/S) 2 , 



P[ x l(6) > M N , S ] < i 



Proof. Using Claim [21 



> M.A < - 7g ^-.^ t , P (- 'v^-^ ) 

(2Vl^N + VS - VS)\ 1 
< 1 • exp( ) < — . 



□ 



Claim 4 For the eigenvalue vector A := (Ai, . . . , A„) of V , the non-centrality pa- 
rameter vector S := (S\, . . . , S n ), define 

^max ■— max A ? - 

l<j<n J 



M Ntnt \ t5j := n\ max (2 v / logN + yfd^f 
(5 := argmaxP[xf(^) > r^^] 

5j ^^max 



Proof. For any M > 0, using Claim [T] and [2 

n n ,j n ,j 

P[Wl>M]=p[^A 3 - X ffe)>M] <^p[ x f(5 j )>— ] <£p[x?(*i)> 



3=1 3=1 J 3=1 

< n ■ max P[xi(<5j) > — — — 1 ■ 



l<j<n L n\max 

Set M = Mn,u,x,s and using Claim [3] 

P[||y||| > M NM ] <n-¥[xl(S) > (2^N + VS) 2 } < jL. 



□ 



At this point, we have proven (IA.5I) . Due to Assumption 1 (and by using the 
same techniques as in the proof of Lemma [T]) , Xmax < \<f{<l + 1)K 3 := b\ and 
of* < nK 2 e K := &2 for all i and j. Thereby, we define 



3 

\2 



M N := fei(2 v / iogiV + a/62) S 
Hence we choose Mjv of the order log N. We now use these results to derive formula 
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1 N 



Gi(yi)>M N } > "4" 



l ogiV 



1 N 



and using Markov's inequality gives 



{ci+c 2 ||yi||2+c 3 | >M N } 



> Gt4 



logiV 



JV 



1 1 

0,4 logJV 



2\ 

ci^p[ci+c 2 ||yi]| 2 + C3||w|]l > Mat] 



»£e[ii» 

i=l 
JV 

'X) E [llwllal { c 1 +c a ||» < 



.] 



ll2+c 3 ||l/ill|>Mj V } 



+ C 2 Z^^[\\y^\^ 1 -{c 1 +c 2 \\y I \\ 2 +c 3 \\y t \\l>M N } \ 

i=l 

AT 

+ C 3 ' 

i=l 

For any < e < 1/2, we employ Holder's inquality 

1 1 ( N 
< — : — < Ci 



cla log JV 



X^[ci+c 2 ||yi||2 + call^Hl >Mjv] 



+ 



ea5^E[(|| w || a )«] P[ci + C2||Vi||2+ca||»||2 > Mat] ' 



i=l 
N 



+ C 3 



X) K [(llwlla)*]*P[ci + Oalll«lla+cs||»||a > Af, 



Since all moments of the non-central x -distribution are finite, we get 



1 1 

CE4 log JV 



2\ 

ci^p[ Ci + ca||y;||a +c 3 \\yif 2 > Mat] 



+ c 2 ^P[ci +c 2 ||yi||a + ca\\y % \\l > 
i=l 

AT 

+ C3^p[ci + ca||yi|| 2 + c 3 ||yi||i > M 



With (|A.5|I we finally obtain 

AT AT 



< 



2 1 

a4 log JV 



_p L_ 

log JV N 1 -^ 



□ 



Now, we have shown that the probability (IA.2I) in Lemma [5] is large. Defining 
the set J by 



sup 



V N (cf>) - V N (<t> ) 



(pT, v n£* (II/ 3 ~ /3o II i + || r? - ^olb) V A 



< TA I (A.6) 



means that J has large probability. The rest of the proof of Theorem 1 is as in 
Stadler et al.l (l20inh . 
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Appendix B: Proof of Theorem 2 



The proof of the theorem co mprises two main pa rts. First, we have to show that 
three conditions presented in IStadler et al. I (|2010h are fulfilled. Afterwards we can 
present the proof of the theorem. 



Appendix Bl: Verification of the conditions 



We have to check Conditions 1 — 3 in Stadler ct al. (2010). Subsequently, each of 
these is stated as a Lemma and again for simplicity, we drop the index i. 
Le t us introduce a slig htly different parametrization, which coincides with the one 
in IStadler et al.l ( 20101 ) and which simplifies the proofs below. For Xk & W,k = 
1, . . . , n, define X T = [x\, . . . , x n ). Let 

tf T = ti(X) T = (xl(3, . . . , x T n (3, 6 T , 2 log fj) = ((X/3) T , 9 T , g) 
= ((X(3) T , V T ) = = «W) G R d 



be the parameter vector with dimension d 
space is bounded by the constant K: 



q* + 1. By (12), the parameter 



C {0e 



< K, * > 0} 



where ||#||oo := maxi<j<d |i9 3 -|. Let £ ©} be the Gaussian density of 

y and i&(y) its log- likelihood function. Moreover, let #o be the true parameter 
vector. 



Lemma 4 Under Assumption 1 holds 



sup max 
■oe® (ji>j2,j3)e{i,---,<i} 3 



(9 3 



e*(y) <G 2 ( y ), 



where 



sup / G 2 (y)U (y)dn(y) < C 2 < oo. 
xex n J 



Proof. Set G 2 (y) '■= cii H- 1 i 2/ 1 1 2 H- <^3 1 1 2/ 1 i i for appropriate constants di, da, da € K+. 
The proof makes use of the same techniques as the proof of LemmaQ]in the Appendix 
A. □ 

Lemma 5 Under Assumption 2 (b), the Fisher information matrix T(£(X) , rj) is 
strictly positive definite. 

Proof. Fo r v ~ Af„ (j- . V) with V = Z ^Z T + e e I, the Fisher information matrix is 
given by (Mc Culloch and SearleL l200lh 



V 1 

UtriV-^V- 



2l L1 \ y d#r c» s )Sr,3=n+l 



The upper left part of the matrix is given by V -1 , which is positive definite. By 
Assumption 2 (b), the lower right part is also positive definite, hence we get the 
claim. □ 



Lemma 6 Under Assumption 2 (a), for all e > 0, there exists an a e > 0, such that 

inf inf £(0(X)\<& o (X)) > a e . 

xe#"0e©,||0--fl o || 2 >e 
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Proof. Let ■d T = (£ T ,J7 T ), i?o = (£oi r lo)i V an d Vo the corresponding covariance 
matrices. Then log/* (y) - log^(y) = \ log |V| - \ log |V | + \{y - t) T V~ x {y - 



- |(y - to) T WHv - Co). Since £(0|0 O ) := E„ [ log /* (y) - log/*(y) 
follows 



, it 



log M + tr(v -1 vfo) + (Co - Cf^CCo - C) - 

I V | 



By definition of the excess risk £(i9|i9q) > 0. Denote rj T = (6 T , g) and 77^ 



(0q , go) j then we can detail: 



lot 



|Vb| 



^ log 

i=i 



tr(V- 1 V )=X; 



_1 



Thus, we get 



i ~\ n ( 

m#o) = 2 {(6 - o T v- i (io - o} + 2 E 



+ e eo 



3=1 



log 



The first term is strictly positive if Co 7^ C an d zero iff Co = C- The second term is 
a function of the form u — log(u) — 1 > for u > 0. The second term is only zero if 
all terms are exactly zero. Due to Assumption 2 (a), we get the claim. □ 

Appendix B2: Main proof of Theorem 2 

Let us write || W/3||i := J2k=i w k\Pk\- Using the definition of cf>, and on J, we have 
the basic inequality 



£(4#o) + A||W73||i <TA C 



(||/3-A)||i + ||»?-»Jo||2) VA 



M\Wf3 \ 



Invoking the triangle inequality ||W/3o||i — ||W/3s ||i < ||W(/3g — /3 )||i, we 
obtain 



e(<f>\<M + \\\wpsg\\i<TXo 



(||/3-/3o||i + ||t7-t7o|| 2 ) VA 



A||W(/3 So -/3o)||i. 



Since iVk>l/5 for k € Sq and Wk < 1/S for {; £ 5q, we arrive at 



£(</#o)+A/S||/3 sg ||i <TA 



(||/3-/3o||i + l!r)-r7o|| 2 )VA 



+A/5||/3 So -/3o||i. (B.l) 



By the arguments in lStadler et al. for a constant cq independent of N, 

n, p and the design, 



£(0|4>o) > G§ - /3o) T S iV ,„(/3 - A))/cg + H - r7 ||l/ c 2. 
Case 1 Suppose that 

||/3 -/3 ||i + ||*7 -*7o|| 2 < A . 



(B.2) 



Then we find from (|B.1|1 . 

£(4>\<M < £(4>x\ct>o) + X/S\\$s S \\i < TXl + X/6\\$ So - Po\\i < TXl + X/S0 - f3 h 
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and hence 

£($\<M + 2A/5II/3 - A)||i < T\l + 3X/50 - /3 ||i 

< (3A/5 + TA )A . 

Case 2 Suppose that 

||/3-/3 ||i + ||^-r?o||2> A , 

and that 

TAo||7)-T7o|| 2 > (X/S + TX )\\$ So -ft Hi- 
Then we get, adding (X/5 — XoT)\\(3s — /3o||i to left- and right-hand side of (|B.1|) . 

£ (0|0o) + (A/5 - TX )\\$ - /3 ||i < (A/5 + TA )||r7 - r,o|| 2 

< (A/5 + TA ) 2 C 2/2 + ||77 - r7o||l/(2 C 2) 

< (X/5 + TX fc%/2 + £(t\(l> )/2, 



where in the last inequality, we applied (|B.2|) . So then 

£(j>\<f> ) + 2(X/S - TX )\\P - #,|| i < (A/5 + TX ) 2 c 2 . 
Case 3 Suppose that 

\\$-Poh + \\fi-voh> Aq, 

and that 

TA ||T7-r7o|| 2 < (A/5 + TA )||/3 So - /3 ||i. 

Then we have 

£ (0|0o) + (A/5 - TAo)||/3sfg||i < 2(A/5 + TX )\\$ So - Mi- ( B -3) 
Because Ao < A/ (25), inequality (|B.3() implies 

ll&fglli <6||/3 So -A>||i. 

We can therefore apply the restricted eigenvalue condition to — (3$. But first, add 
(A/5 — XqT)\\0Sq — /3o 1 1 1 to the left- and right-hand side of (|B.3|) . The restricted 
eigenvalue condition now gives (invoking 2(A/5 + TAo) + (A5 — TXq) < 3(A/5 + TAo)) 

5(0|0o) + (A/5-rA o )||/3-/3o||i 
< 3(A/5 + TA )^ ||/3 So -/3o|| 2 



< 3(A/5 + TX )^\/((3 - /3o) T Siv,„(/3 - A>) 

< 9(A + TA o ) 2 c2 K 2 So /2 + £(^|0o)/2, 
applying again (IB.2|) in the last step. So we arrive at 

£(4>\ct>o) + 2(A/5 - TX Q )\\/3 - o \\i < 9(A/5 + TX Q f(%K 2 s . 

Appendix B3: Proof of Corollary 3 

For the estimator in (13) we have: 

WPinit - A)||oo < 11/3 - A)||i < ^nit- 
Consider fc € S*o with |/?o.fe| > 5mif. Then it must hold that ^ (since otherwise, 
if /9fc were equal to zero, ||/3 - /3o||oo < \Pk - Po,k\ = \Po,k\ > $init which is a 
contradiction to the (.^ -estimation error bound). The argument for the adaptive 
l\ -penalized estimator (14) is analogous. 
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Appendix C: Computational details of Algorithm 1 



(0): Initial value <p° . As a starting value for f3, we choose an ordinary Lasso so- 
lution by cross-validation ignoring the grouping structure among the observations. 
By doing so, we ensure that we are at least as good (with respect to the objective 
function) as an ordinary Lasso in a linear model. The calculation of the starting 
value for 9 depends on the specific structure of and may be performed as in the 
(Gauss-Seidel) iteration. The point we would like to make is that those elements 
that are estimated as zero in <p° may escape from zero and non- vanishing elements 
of 4>° can be set to zero during Algorithm 1. 

(1): Choice of h . For numerical convergence (see Theorem 3), we require that 
h l is positive and bounde d. We use the diagonal elements of the Fisher information 
1(0) and, as proposed in Tseng and Yunl ([2009), for constants c m i n and 



Cmax we 

set h e = mm(max(2(d>) s i s e , c min ) , c max ) with c mm = 10~ 6 and c max = 10 s in the 
R package lmmlasso. 

(2): Calculation of d . We have to distinguish whether the index S appears in 
P(4>) or not: 

.n,,!ian.( ^ ^ I S e e{l,...,p}, (C1) 



else. 



(3): Choice of or. The step length or is chosen in such a way that in each step, 
there is an improvement in the objective function Q\(.). We use the Armijo rule 
which is defined as follows: 

Choose oq nit > and let a e be the largest element of {af nit S r } r= o,i,2,.. satisfying 

Q A (0 £ + oVe s < Qxitf) + a e gA e , 
where A e := 8 / ' d^g^d 1 + j(d e ) 2 h e + \P(<j) e + d e e s t) - XP^). 



The choice of the constants comply with the suggestions in lBertsekas (1999) and 
are 5 = 0.1, g = 0.001, 7 = and a e mlt = 1 for all £. 

Simplification of (2) and (3) for the (3 '-parameter. lfl{(f)) s t s i is not truncated, 
we take advantage of the fact that g(<f>) is quadratic with respect to [3. Using 
oq nit = 1, the stepsize a 1 chosen by the Armijo rule (r = 0) leads to the minimum 
of g{4> 1 ) with respect to f3 s t.. The update /?^ 1 (A) is then given analytically by 

f (\(y-y) T V-i Xse \-\) 
^Y(X) = sign ((„ - vfV-^p x T eV -i Xse — ' ( C - 2 ) 

where X = (x\, . . . , x p ), y = X^ s ^0i s e\ (leaving out the 5^th variable), (.) + = 
max(.,0) and sign(.) the signum function. 

Most often, I(<p) s t s e is not truncated and hence the analytical formula (IC2|) can 
be used. This simplification reduces the computational cost remarkably, especially 
in the high-dimensional setup. 

Parametrization of ^ . We parametrize ^ by a set of unconstrained parameters 
6. A discussion how to parametrize a positive definite variance-covariance matrix 
by an unconstrained set of pa rameters can be found in iPinheiro and Bated (|2000l) 



and Pinheiro and Bates! dlflflfih . In the current version of the lmmlasso package we 
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employ the Cholesky decomposition = LL T where 6 corresponds to the lower 
triangular elements of L. 



Choice of the X-sequence. We choose a Ai sufficiently large such that all penalized 
coefficients are zero. We calculate a sequence Ai > A2 > . . . on a log-scale until a 
model with a certain sparsity level is reached. At latest, we stop if the number of 
selected fixed-effects variables is larger than the total number of observations. The 
optimal A is then chosen by 

X opt = arg min BIC\ k . 

k>l 

Active-Set Algorithm. Assuming that the solution is sparse, we can reduce the 
com puting time by u s ing an active-set algorithm, which is used in I Meier et al. 



and iFriedman et al. I (120101) . More specifically, we do not cycle through all coor 



dinates, but we restrict ourselves to the current active set S(f3) and update all 
coordinates of (3 only every Dth iteration. This reduces the computational time 
considerably. 

Proof o f Theorem 3. Fo r the p recise definition of cluster and stationary point 



we refer to iTseng and Yunl (|2009f ). It remains to check that the assumptions in 
Tseng and Yunl (2009) are fulfilled. More precisely: A > 0, P{.) = |.|i is a proper, 



convex, continuous function and block-separable with respect to S , g(.) is contin- 
uously differentiable on dom(P) = {<p\P{<p) < 00}, c m i n < h e < c max for i > and 
< c min < c m ax- Moreover, sup £ a 1 > and inf^ a\ nit > 0. 



Appendix D: Simulation study for the low-dimensional 
setting 

In this setting, we will compare Imm Lasso and ImmadLa s so wit h the classical linear 
mixed-effects framework (Ime) from Pinheiro and Bates] (|2000l) and both the Lasso 



and the adaptive Lasso. The optimal model for the Ime procedure is determined 
by backward elimination. 

The two examples are chosen in the following way (/?o,i = 1 is the unpenalized 
intercept): 

W. N = 25, n = 6, N T = 150, p = 10, q = 3, a 2 = 0.25, 9 2 = 0.56 and s = 5 
with f3 = (1,2,4,3,3,0,...,0) T . 

L 2 : N = 30, n = 6, N T = 180, p = 15, q = 3, a 2 = 0.25, 



and s = 5 with (3 a = (1, 2, 4, 3, 3, 0, ... , 0) T . 

The results in the form of means and standard deviations (in parentheses) over 
100 simulation runs are reported in Table [71 [5] and Therein, |5(/3)| denotes the 
cardinality of the estimated active set and TP is the number of true positives. We 
would like to emphasize that we do not penalize any covariate having a random- 
effects coefficient (indicated by an asterisk *). 

We see from the tables that the estimated average active set is sparse and only 
slightly larger than the cardinality of the true active set So = S((3 n ). This property 
might be expected because it is known from linear regression that the BIC selects a 
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Table 7: Comparison of ImmLasso, ImmadLasso, lme. Lasso and adLasso for model 
Li 



Method 


\S($)\ 


TP 


a 2 


6> 2 


ft 


ft 


ft 


ft 


ft 


true 


5 


5 


0.25 


0.56 


1 


2 


4 


3 


3 


ImmLasso 


5.94 


5 


0.24 


0.55 


0.99* 


2.01* 


4.03* 


2.94 


2.95 




(1.04) 


(0) 


(0.04) 


(0.11) 


(0.14) 


(0.15) 


(0.15) 


(0.06) 


(0.06) 


ImmadLasso 


5.11 


5 


0.24 


0.55 


0.99* 


2.01* 


4.02* 


2.99 


3 




(0.31) 


(0) 


(0.04) 


(0.11) 


(0.14) 


(0.15) 


(0.15) 


(0.05) 


(0.06) 


lme 


5.14 


5 


0.24 


0.55 


0.99* 


2.01* 


4.02* 


2.99* 


3* 




(0.35) 


(0) 


(0.04) 


(0.11) 


(0.14) 


(0.15) 


(0.15) 


(0.05) 


(0.06) 


Lasso 


5.54 


5 


1.85 




1.00* 


1.99* 


4.04* 


2.88 


2.89 




(0.69) 


(0) 


(0.38) 




(0.16) 


(0.20) 


(0.18) 


(0.12) 


(0.12) 


adLasso 


5.54 


5 


1.81 




1.00* 


1.99* 


4.01* 


2.99 


3.00 




(0.69) 


(0) 


(0.37) 




(0.16) 


(0.20) 


(0.18) 


(0.11) 


(0.11) 



* indicates that the corresponding fixed-effects coefficient is not subject to penalization 



Table 8: Comparison of ImmLasso, ImmadLasso, lme, Lasso and adLasso for model 
L 2 



Method 


\S(P)\ 


TP 


a 2 


ft 


ft 


ft 


ft 


ft 


true 


5 


5 


0.25 


1 


2 


4 


3 


3 


ImmLasso 


7.33 


5 


0.24 


1.00* 


1.96* 


3.99* 


2.95 


2.94 




(1.54) 


(0) 


(0.04) 


(0.42) 


(0.24) 


(0.18) 


(0.05) 


(0.06) 


ImmadLasso 


5.31 


5 


0.24 


1.00* 


1.96* 


3.98* 


3 


2.99 




(0.72) 


(0) 


(0.04) 


(0.42) 


(0.24) 


(0.18) 


(0.05) 


(0.06) 


lme 


4.85 


4.75 


0.24 


0.73* 


1.86* 


3.95* 


3* 


2.99* 




(0.75) 


(0.5) 


(0.04) 


(0.67) 


(0.32) 


(0.19) 


(0.05) 


(0.06) 


Lasso 


5.59 


5 


8.43 


1.00* 


1.92* 


4.05* 


2.72 


2.68 




(1.02) 


(0) 


(2.27) 


(0.44) 


(0.38) 


(0.29) 


(0.27) 


(0.24) 


adLasso 


5.59 


5 


8.23 


1.00* 


1.92* 


3.98* 


2.99 


2.94 




(1.02) 


(0) 


(2.21) 


(0.44) 


(0.37) 


(0.30) 


(0.26) 


(0.23) 



* indicates that the corresponding fixed-effects coefficient is not subject to penalization 



Table 9: Covariance estimates of ImmLasso and lme for L<± 



Method 


*ii 


*12 


*13 


*22 


*23 


*33 


true 


5 


2 


0.5 


2 


1 


1 


ImmLasso 


4.83 


1.95 


0.58 


1.91 


1.03 


1.04 




(1.26) 


(0.76) 


(0.51) 


(0.58) 


(0.38) 


(0.32) 


ImmadLasso 


4.84 


1.95 


0.58 


1.92 


1.04 


1.04 




(1.26) 


(0.76) 


(0.51) 


(0.58) 


(0.39) 


(0.32) 


lme 


5.03 


2.01 


0.6 


1.94 


1.04 


1.04 




(1.43) 


(0.87) 


(0.53) 


(0.60) 


(0.39) 


(0.33) 



sparse model. All methods except lme in model L2 are including the true non-zero 
coefficients in the active set. Concerning variance components, we clearly see that 
the estimated error variance of Lasso and adLasso can be reduced and split into the 
within and between-subject variability by ImmLasso and ImmadLasso, respectively. 
The tables show that the penalized fixed-effects coefficients from ImmLasso have 
a bias. However, it is smaller than that of the corresponding coefficients from the 
Lasso. By using ImmadLasso, we can attenuate the bias problem. From Table |H] 
we note that the variance component estimates of St are underestimated compared 
to the results from lme. However, by a closer look, the fixed effects of lme have a 
larger bias in lme than in ImmLasso and ImmadLasso. It seems that lme estimates 
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the variance components more precisely while underestimating the corresponding 
fixed effects. Finally, it must be recognized that the backward selection used for 
lme regularly breaks down due to convergence problems within the R- function lme. 
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